Exploratory Data Analysis and Preprocessing

With the ultimate goal of training a BERT text classifier to identify the nationality/L1 of non-native writers of English, the following project:

  1. Unifies and preprocesses data from multiple corpora
  2. Explores each corpus and L1 category quantitatively
  3. Performs Lexical Analysis on tokens across the combined corpus, and
  4. Examines limitations, design issues, and questions related to these findings

Unifying Data from Multiple Corpora

~25 hours

Corpora included:

  1. ICLE
  2. EFCAMDAT
  3. PELIC

Access Pending:

  1. ETS Non-native
  2. CEC Cambridge English Corpus

The following code extracts samples from each corpus, and unifies the labels and samples into a single dataset. Brief descriptions of each corpus are also provided.

In [1325]:
%%HTML
<script src="require.js"></script>
In [1323]:
import os
import re

#plotting
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import plotly.express as px
import plotly
import plotly.io as pio
pio.renderers.default='notebook'

#data handling
import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET
import bs4
from bs4 import BeautifulSoup

pd.set_option('display.max_rows', 75)
pd.set_option('display.max_columns', 10)
In [142]:
# main directories
project_dir = "/Users/paulp/Desktop/UEF/Thesis"
corpus_dir = os.path.join(project_dir, 'Data')

# relative corpus directories
ICLE_dir = os.path.join(corpus_dir, "ICLE/split_texts")
EFCAMDAT_dir = os.path.join(corpus_dir, 'EFCAMDAT')
LANG8 = os.path.join(corpus_dir, 'NAIST_LANG8/lang-8-20111007-2.0/lang-8-20111007-L1-v2.dat')
PELIC = os.path.join(corpus_dir, 'PELIC/PELIC_compiled.csv')
os.chdir(corpus_dir)

ICLE

https://uclouvain.be/en/research-institutes/ilc/cecl/icle.html

Description

The Version 2 of the International Corpus of Learner English from UC Louvain. Samples adhere closely to Atkins and Clear's (1992) corpus design criteria [ICLE]. Most samples in ICLE are argumentative essays collected from academic environments, representing a range of suggested topics.

The data available to UEF users in V2 does not represent the full range of L1/nationalities of interest. This will be addressed further.

In [3]:
files = os.scandir(ICLE_dir)
nationalities = {}
for a in files:
    b = re.split('-', a.name)[1]
    if b not in nationalities.keys():
        nationalities[b] = 1
    else:
        nationalities[b] += 1
nationalities
Out[3]:
{'GE': 281,
 'CN': 757,
 'JP': 365,
 'SW': 255,
 'PO': 350,
 'FIN': 193,
 'TR': 255,
 'RU': 250,
 'SP': 186}
In [4]:
dataset = pd.DataFrame(data = None, columns = ['Corpus','Target','Text'])
In [205]:
# fill dataframe with samples 
files = os.scandir(ICLE_dir)
for b,a in enumerate(files):
    target = re.split('-', a.name)[1]
    c = open(a)
    text = c.read()
    dataset.loc[b,'Target'] = target
    dataset.loc[b, 'Text'] = text
    dataset.loc[b, 'Corpus'] = 'ICLE'
    c.close()
In [207]:
# Remove Swedish and Polish (data too sparse)
dataset = dataset[dataset['Target'] != 'SW']
dataset = dataset[dataset['Target'] != 'PO']
dataset = dataset[dataset['Target'] != 'FIN']
len(dataset)
Out[207]:
2094

EFCAMDAT

https://philarion.mml.cam.ac.uk/

Description

This corpus is a collaboration between EF Education First and the Department of Theoretical and Applied Linguistics at the University of Cambridge. The samples were collected from English Live, EF's online language school. Samples are sortable by nationality, level, and other provided variables. As in ICLE, nationality is assumed to correlate with L1.

Notes

At first, levels 10-16 were selected for this project; based on the corpus documentation, this corresponds to B2+ CEFR levels [], which is harmonious with the ICLE corpus. However, after this initial exploration, it seemed that the levels were inflated, perhaps because they represent overall English competence rather than being distinctly reflective of writing skills. Ultimately, levels 12-16 were selected to filter out some of the lower quality samples.

To address an under-representation of Spanish language data, Spanish was also sampled from a few Latin American countries. These varieties of Spanish may well impact the model's ability to pick up on 'general' characteristics of Spanish-influenced L2 English, but for now the increase in volume and balanced representation will be assumed a benefit rather than a drawback.

In [208]:
# Process the XML file from EFCAMDAT
efcamdat = os.path.join(EFCAMDAT_dir, 'EF201403_selection1854.xml')
with open(efcamdat) as fp:
    soup = BeautifulSoup(fp, features='lxml-xml')
In [209]:
# REMINDER: add Arabic, Korean, and Latin Spanish here
efcamdat_ds = pd.DataFrame(data=None, columns = ['Corpus', 'Target', 'Text'])
nationalities = {'cn':'CN', 
                 'de':'GE', 
                 'es':'SP',  
                 'jp':'JP', 
                 'ru':'RU', 
                 'tr':'TR'}

# Build the DataFrame
for s in soup.find_all('writing'):
    level = int(s.get('level'))
    text = s.find_all('text')[0].text
    #filter out lower level texts
    if level >= 12:
        nationality = s.find_all('learner')[0].get('nationality')
        if nationality in nationalities:
            d = pd.DataFrame(data = {'Corpus': ['EFCAM'], 
                                    'Target': [nationalities[nationality]],
                                    'Text': [text]
                                    }
                            )
            efcamdat_ds = pd.concat([efcamdat_ds, d])
        else:
            pass
    else:
        pass
               
In [210]:
data = dataset.append(efcamdat_ds)
dataset['Target'] = pd.Categorical(dataset['Target'])
dataset['Corpus'] = pd.Categorical(dataset['Corpus'])
/var/folders/5h/lrwcctsx1xv_r4qlss7b9mt80000gp/T/ipykernel_25565/886358872.py:1: FutureWarning:

The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

In [211]:
data.describe()
Out[211]:
Corpus Target Text
count 10242 10242 10242
unique 2 6 10213
top EFCAM GE \n will be done shortly\n
freq 8148 3889 6

PELIC

https://eli-data-mining-group.github.io/Pitt-ELI-Corpus/

Description

PELIC contains writing samples from students in the University of Pittsburg English Language Institute, an intensive EAP program.

Notes

Because the data is longitudinal, only one writing sample per student was selected: this to prevent the model from identifying the characteristics of individual writers rather than the target group, although the number of samples per student is relatively small in relation to the corpus size. Levels 4-5, corresponding to B1+, were selected. This may later be narrowed to level 5 to better reflect the composition of the other corpora.

In the case of PELIC, L1 (not nationality) is the variable label. Provided that the documentation of ICLE and EFCAMDAT are correct, it is reasonable to fuse nationality and L1 into a variable called 'Target' without significantly polluting the variable.

In [212]:
pelic_ds = pd.read_csv(PELIC)
In [213]:
pelic_nationality_map = {'Arabic':'AR', 
                         'Korean':'KO', 
                         'Chinese':'CN', 
                         'Japanese':'JP', 
                         'Spanish':'SP',
                         'Turkish':'TR',
                         'Russian':'RU',
                         'German':'GE'
                        }
In [214]:
# Filter by level and L1
reduced = pelic_ds.filter(items=['level_id', 'L1', 'text'])
reduced = reduced.query("level_id >= 4")

# get text and target, change target name
reduced = reduced.filter(items=['L1', 'text'])
reduced_pelic = reduced.apply(lambda row: row[reduced['L1'].isin(pelic_nationality_map.keys())])

# add corpus label and rename columns
reduced_pelic['Corpus'] = 'PELIC'
reduced_pelic = reduced_pelic.rename(columns={'L1':'Target', 'text':'Text'})
reduced_pelic['Target'] = reduced_pelic['Target'].apply(lambda row: pelic_nationality_map[row])

#append to main data
data = data.append(reduced_pelic)
/var/folders/5h/lrwcctsx1xv_r4qlss7b9mt80000gp/T/ipykernel_25565/2718693952.py:15: FutureWarning:

The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

In [215]:
data['Corpus'].value_counts()
Out[215]:
PELIC    29142
EFCAM     8148
ICLE      2094
Name: Corpus, dtype: int64

Visualizing and Examining the Corpora

~ 50 hours

Thus far, there are three corpora in the dataset with the number of samples noted above, but more detail about the nature and distribution of the samples is needed, along with insight as to how this may influence results and inform design. The code and visualizations below show:

  1. the number of samples in each corpus corresponding to each target group
  2. the distribution of sample lengths in tokens for each target in each subcorpus

Note that the zoom feature can be used to isolate specific distributions in the visualizations for more clarity.

Design-related questions are addressed both throughout and at the end of the section.

In [216]:
corpus_colors = {'blue':'ICLE', 'green':'EFCAM', 'violet':'PELIC'}
In [1326]:
fig = px.bar(data, 
             x=data['Target'], 
             color=data['Corpus'], 
             opacity=0.8, 
             title = 'Number of Texts by Nationality Group',
             color_discrete_map = corpus_colors)
fig.update_traces(dict(marker_line_width=0)) #run this line if the visualization looks cloudy
fig.show(renderer='notebook')

Note that Arabic and Korean will have data from EFCAMDAT added in the final version. Turkish may be dropped from the project if no other sources of data are found.

In [218]:
# Calculate and Append text lengths using BERT tokenizer

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
data['Length'] = None
In [219]:
data = data.reset_index(drop=True)
In [220]:
Length = [lambda x: len(tokenizer(x)['input_ids'])]
data['Length'] = data['Text'].apply(func = Length, result_type='expand')
Token indices sequence length is longer than the specified maximum sequence length for this model (557 > 512). Running this sequence through the model will result in indexing errors
In [1137]:
fig = px.strip(data, 
                y="Length", 
                x="Target", 
                color="Corpus", 
                color_discrete_map = corpus_colors,
                hover_data=None,
                title='Distribution of Text Lengths'
              )

fig.show()
In [230]:
# alternative visualization to the strip plot is the violin plot.
# zoom for more clarity.

fig = px.violin(data, 
                y="Length", 
                x="Target", 
                color="Corpus", 
                box = True,
                points = None,
                color_discrete_map = corpus_colors,
          hover_data=None)
fig.show()
In [1138]:
px.histogram(data,
             x='Length', 
             color = 'Corpus', 
             color_discrete_map = corpus_colors,
             range_x = [0,1500],
             opacity=1.0,
             title= 'Distribution of Text Lengths Overall'
             )

Notice the many tiny samples with length <= 50 in EFCAMDAT and PELIC. These are mostly non-informative entries that indicate the task was beyond the students' abilities or they did not have time to complete the task. These are filtered out at a threshold of 170 tokens to make the training samples more informative and efficient.

This threshold was chosen to minimize the number of excluded samples while also making sure the samples are substantial and worth training on. More implications of sample length regarding BERT models will be mentioned later and discussed more fully in the next stage of the project.

In [249]:
# trim below 170 tokens
data = data.query('Length >= 170')
px.histogram(data,
             x='Length', 
             color = 'Corpus', 
             color_discrete_map = corpus_colors,
             range_x = [170,1500],
             opacity=1.0,
             title = 'Frequency Distribution of Text Lengths'
             )
In [248]:
px.histogram(data,
             x='Length', 
             color = 'Corpus', 
             cumulative = True,
             barmode = 'overlay',
             histnorm = 'percent',
             color_discrete_map = corpus_colors,
             range_x = [170,1500],
             opacity=0.4,
             title = 'Cumulative Distribution of Text Lengths'
             )

Findings, Impacts, and Decisions

Target Representation

There are some data imbalance issues, namely that Turkish is underrepresented. One option would be to find data from a separate Turkish learner corpus for inclusion. As can be seen above, however, corpora can vary greatly in composition, quality, and length of samples. Introducing a corpus that represents only one target group might have confounding impact.

Another option is regularizing the model such that more prevalent target groups are not predicted arbitrarily: this approach 'punishes' the model for predicting German or Chinese or Arabic simply because they appear more frequently.

A third option would be to drop Turkish from the data entirely. This would have the benefit of simplifying the classification problem, which is already quite complex, although it underscores a criticism of big data approaches to low-resource languages: although these are the languages in need of more research, they tend to be left out of data-heavy studies. Although Turkish is not resource scarce, by comparison there is a lot less data at our disposal.

Sample Lengths

A principle design decision in BERT models is setting the maximum sample length in number of tokens. Although this can hypothetically be set as high or low as desired, it comes at performance costs. The standard medium-sized, pretrained BERT model has a max length of 512 tokens. If a training sample is shorter than the max length, mask tokens are passed to the model so it ignores the empty spaces at the end of the sample. If it is longer than the max length, it is truncated, and the end of the sample is lost.

Doubling the max length incurs a computational cost of (at least) a power of 2, as attention weights have to be calculated for each pair of tokens. My machine can handle max_len = 1024, although a single training epoch takes about two hours. Max length of 256 trains faster, but clips quite a bit off of longer samples, leading to massive data loss. This decision will be explored in more detail at the next stage of the project.

In [540]:
data_multi = data.set_index(['Target', 'Corpus']).sort_index()
In [541]:
data_multi.loc[('CN', 'PELIC'), :]
Out[541]:
Text Length
Target Corpus
CN PELIC In Taiwan, we have a proverb, "Far relative ca... 429
PELIC Some people said, "Not all learning takes plac... 363
PELIC There have been lots of debates on the issue t... 278
PELIC Each person has a dream; so when he realizes h... 261
PELIC In 2001, I met my good friend Bingbing while s... 187
... ... ...
PELIC When I was a child, my parents were too busy t... 245
PELIC Legalize Marijuana (Sherry, 4P, 07/23/2012)\nI... 608
PELIC Recently, I have watched a comedy starring Mar... 200
PELIC Summer vacation\n Most children like their sum... 673
PELIC Intergenerational Housing\nThe housing market ... 606

1640 rows × 2 columns

Lexical Analysis

Named Entity Recognition

Learners of different nationalities often write about the places, people, and organizations that they know: if certain tokens ('China' or 'Islam' for example) occur disproportionately in one target group, the model will likely use these as a basis of its decision making rather than looking at the structure of the text.

To test informally whether this hypothesis has any merit, we perform NER over the corpus using Stanza, and then compare the results to some measures of dispersion to gauge any correlation. Note that Stanza is trained on very clean data, and (as was shown in a previous notebook) does not perform as well on so-called 'noisy' data in which there are mistakes.

Dispersion

"Standard deviation is a useful measure when we want to see how homogeneous or heterogeneous the distribution of a word is." (Brezina, 2018, 50)

Finding the SD and then the Coefficient of Variation (CV) per for each token across the target groups, we can determine the imbalanced and frequent tokens which are most likely to make the classification task too easy for BERT.

Here I explore coefficient of variation and the dispersion of proportions. For DP, values closer to 1 indicate an uneven distribution, while near-zero values are more even. CV can take values above 1, but higher values also mean uneven distribution.

In [688]:
import stanza 
processors = {'tokenize':'ewt','ner':'conll03'}
ner = stanza.Pipeline('en', processors=processors, package='ewt')
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 154kB [00:00, 18.6MB/s]
2022-08-04 10:00:34 WARNING: Language en package ewt expects mwt, which has been added
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/p
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/l
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/d
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/p
2022-08-04 10:01:23 INFO: Loading these models for language: en (English):
=======================
| Processor | Package |
-----------------------
| tokenize  | ewt     |
| mwt       | ewt     |
| pos       | ewt     |
| lemma     | ewt     |
| depparse  | ewt     |
| ner       | conll03 |
=======================

2022-08-04 10:01:23 INFO: Use device: cpu
2022-08-04 10:01:23 INFO: Loading: tokenize
2022-08-04 10:01:23 INFO: Loading: mwt
2022-08-04 10:01:23 INFO: Loading: pos
2022-08-04 10:01:24 INFO: Loading: lemma
2022-08-04 10:01:24 INFO: Loading: depparse
2022-08-04 10:01:24 INFO: Loading: ner
2022-08-04 10:01:24 INFO: Done loading processors!
In [764]:
#Generate a rough list of tokens which are parts of named entities across the corpus
#this takes a while to run

NEs = {}
for a in data['Text']:
    d = ner(a).to_dict()
    for s in d:
        if s[0]['ner'] != 'O':
            NEs[s[0]["text"]] = s[0]['ner']
In [303]:
data = data.reset_index(drop=True)
In [569]:
#freq_dist = pd.DataFrame(freq_dist)
#freq_dist.columns = ['Target', 'Corpus', 'Token']
In [ ]:
# Generate a Labeled token list 

iterables = [["Target"], data['Target'].unique()]
col = pd.MultiIndex.from_product(iterables, names=["1", "2"])

frequency_list = pd.DataFrame(data = None,
                              columns = col)
for a in data.index:
    tgt = data.loc[a, 'Target']
    ts = tokenizer.tokenize(data.loc[a,'Text'])
    for t in ts:
        if t not in frequency_list.index:
            frequency_list.loc[t] = 0
        frequency_list.loc[t, tgt] += 1
In [599]:
frequency_list['Total'] = frequency_list.sum(axis=1)
In [787]:
NE = []
for a in frequency_list.index:
    try:
        NE.append(NEs[a][2:])
    except:
        NE.append('O')
    
frequency_list['NE'] = NE
In [769]:
frequency_list.sort_values(by=['Total'], ascending=False)
Out[769]:
first Target Total NE
second GE CN JP TR RU SP AR KO
. 39561 67222 35120 18624 29502 17072 48249 39257 294607 O
, 25677 57087 27070 12838 21833 17914 39775 33887 236081 O
the 31657 50758 21134 13384 20085 16743 38137 20279 212177 O
to 21768 34854 17017 9065 15284 10733 25860 17863 152444 O
and 18306 26442 12200 7403 13965 9041 23550 13696 124603 O
... ... ... ... ... ... ... ... ... ... ...
Tier 1 0 0 0 0 0 0 0 1 O
##enbach 1 0 0 0 0 0 0 0 1 O
Alexandre 0 0 0 0 1 0 0 0 1 O
anthropology 0 0 0 0 1 0 0 0 1 O
##sket 0 0 0 0 0 0 1 0 1 O

23317 rows × 10 columns

"DP (Deviation of Proportions) is a measure proposed by Gries (2008) which compares the expected distribution of a word or phrase in different corpus parts with the actual distribution. " (Brezina, 52, 2018)

"The coefficient of variation is a standardized measure; this means that it can be compared across different words and phrases in one corpus. The closer the coefficient is to zero, the more even the distribution of the word or phrase is." (Brezina, 50, 2018)

In [789]:
def get_DP(df, col):
    exp_prop = df[col].sum(axis=0) / df[col].sum(axis=0).sum()
    obs_prop = df[col].div(df[col].sum(axis=1), axis=0)
    DP = obs_prop.sub(exp_prop).sum(axis=1).abs()/2
    return DP

def get_SD(df, col):
    cats = df[col].shape[1]
    mean = df[col].sum(axis=1) / cats
    sos = df[col].sub(mean, axis = 0).pow(2).sum(axis=1)
    SD = sos.div(cats).pow(0.5)
    return SD

def get_CV(df, col):
    sd = get_SD(df, col)
    mean = df[col].sum(axis=1) / cats
    CV = sd.div(mean)
    return CV
    
In [790]:
f2 = frequency_list.assign(DP = get_DP(frequency_list, 'Target'),
                         SD = get_SD(frequency_list, 'Target'),
                         CV = get_CV(frequency_list, 'Target')
                          ).sort_values('SD', ascending=False).reset_index()
In [1147]:
f2.to_csv(os.path.join(corpus_dir, 'frequency_dist_2.csv')) #save
f2 # frequency list with some common dispersion measures added
Out[1147]:
first index Target Total NE DP SD CV SD_log CV_exp Total_log Mask
second GE CN JP TR RU SP AR KO
0 . 39561 67222 35120 18624 29502 17072 48249 39257 294607 O 1.040834e-17 15189.949238 0.412480 9.628389 1.510560 12.593398 0
1 , 25677 57087 27070 12838 21833 17914 39775 33887 236081 O 1.040834e-17 13119.755623 0.444585 9.481874 1.559843 12.371930 0
2 the 31657 50758 21134 13384 20085 16743 38137 20279 212177 O 0.000000e+00 11865.583598 0.447384 9.381397 1.564215 12.265176 0
3 to 21768 34854 17017 9065 15284 10733 25860 17863 152444 O 0.000000e+00 7843.208256 0.411598 8.967403 1.509228 11.934553 0
4 and 18306 26442 12200 7403 13965 9041 23550 13696 124603 O 3.469447e-18 6286.282485 0.403604 8.746125 1.497211 11.732888 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
23312 Armstrong 0 0 1 0 0 0 0 0 1 O 6.938894e-18 0.330719 2.645751 -1.106486 14.094030 0.000000 0
23313 orbit 0 0 1 0 0 0 0 0 1 O 6.938894e-18 0.330719 2.645751 -1.106486 14.094030 0.000000 0
23314 Thames 0 0 0 0 1 0 0 0 1 O 6.938894e-18 0.330719 2.645751 -1.106486 14.094030 0.000000 0
23315 loosened 1 0 0 0 0 0 0 0 1 O 6.245005e-17 0.330719 2.645751 -1.106486 14.094030 0.000000 0
23316 ##sket 0 0 0 0 0 0 1 0 1 O 6.938894e-18 0.330719 2.645751 -1.106486 14.094030 0.000000 0

23317 rows × 18 columns

In [797]:
# transform the statistics so they are clearer to visualize

f2['SD_log'] = f2['SD'].transform(np.log)
f2['CV_exp'] = f2['CV'].transform(np.exp)
f2['Total_log'] = f2['Total'].transform(np.log)
In [1315]:
px.scatter(f2,
        x = 'Total_log',
           y= 'DP',
       color='NE',
           hover_data=['index'],
           
            title= 'Log of Absolute Frequency')

The most informative visualization plots the log of the total token frequency against the exponent of the coefficient of variation. These transformations exaggerate the high-risk values, spreading them out to be more visible.

In [1317]:
fig = px.scatter(f2,
            x = 'Total_log',
            y= 'CV_exp',
            color='NE',
            opacity=0.6,
            hover_data=['index'],
            title= 'Transform of CV and absolute frequency')
fig.show()

Notice how most of the country names and language names appear in the sparse area on the upper left. There is also some suggestion of topical imbalance, with words like: 'credit', 'card', 'debt', 'repay', and 'betting' appearing in this area as well. So named entities are not necessarily the only tokens which will have to be masked.

At first, I tried density-based clustering to try to separate the sparse values at the upper right from the denser values. Although this delivered intuitive results, another algorithm will be used to make a more principled selection of tokens to be masked.

Use sklearn's nearest neighbors algorithm to find the epsilon and

https://www.section.io/engineering-education/dbscan-clustering-in-python/#:~:text=DBSCAN%20is%20a%20popular%20density,number%20of%20clusters%20required%20prior.

In [958]:
from sklearn.neighbors import NearestNeighbors
mdl = f2.loc[:, ['CV_exp', 'Total_log']]
nn = NearestNeighbors(n_neighbors=2)
nbrs=nn.fit(mdl) # fitting the data to the object
distances,indices=nbrs.kneighbors(mdl)

distances = np.sort(distances, axis = 0)
distances = distances[:, 1] 

# plotting the distances
px.scatter(distances)
In [1022]:
from sklearn.cluster import DBSCAN
# cluster the data into five clusters
dbscan = DBSCAN(eps = 0.80, min_samples = 100).fit(mdl) # fitting the model
#dbscan = DBSCAN(eps = 1.3, min_samples = 100).fit(mdl) # fitting the model
labels = dbscan.labels_ # getting the labels
In [1141]:
f2['Mask'] = labels
f2.loc[f2['CV_exp']<=5.0, 'Mask'] = 0 # keep the high-frequency, low CV items out of the filter
f2.loc[f2['Total_log']<3.70, 'Mask'] = 0 # get low frequency items out of the filter
f2['Mask'] = pd.Categorical(f2['Mask']) # change from float to categorical type

freq_thresh = 3.8
cv_thresh = 3.7
x_range = np.array(np.arange(freq_thresh, max(f2['Total_log']), 0.01))
b = 0.5
filter_func = 1/(b*(x_range-freq_thresh))+cv_thresh

fig = px.scatter(f2,
            x = 'Total_log',
            y= 'CV_exp',
            color='Mask',
            opacity=0.6,
            hover_data=['index'],
            title= 'Masking ',
            )
fig.add_scatter(x = x_range, y = filter_func, mode = 'lines')
fig.update_layout(yaxis_range=[0, max(f2['CV_exp'])+1.0], 
                  xaxis_range = [0, max(f2['Total_log'])+1.0])
#fig.update_traces(legendgrouptitle=dict('text','Filter'))
fig.show()
/var/folders/5h/lrwcctsx1xv_r4qlss7b9mt80000gp/T/ipykernel_25565/2277061660.py:10: RuntimeWarning:

divide by zero encountered in true_divide

The points in red would be masked in the dbscan approach. This looks intuitive but does not follow any principled logic. I will first clarify a problem with the dbscan mask as is, and then propose a way to optimize the position and shape of the green line to form the mask boundary.

In [1025]:
f2.style.set_sticky(axis='columns')#.hide(axis=”index”)
f2.loc[f2['Mask'] == -1].sort_values(by='Total_log', ascending=False)
Out[1025]:
first index Target Total NE DP SD CV SD_log CV_exp Total_log Mask
second GE CN JP TR RU SP AR KO
31 > 55 3584 28 35 17 21 50 151 3941 O 4.163336e-17 1169.117502 2.373240 7.064004 10.732111 8.279190 -1
30 < 54 3573 24 26 18 9 33 97 3834 O 6.245005e-17 1169.612943 2.440507 7.064428 11.478858 8.251664 -1
46 R 136 2735 44 28 59 57 320 74 3453 PER 4.857226e-17 875.046133 2.027330 6.774277 7.593781 8.146999 -1
35 Hong 0 3355 60 0 0 0 2 5 3422 LOC 5.551115e-17 1106.565266 2.586944 7.009016 13.289097 8.137980 -1
37 Kong 1 3316 1 0 0 0 0 0 3318 LOC 6.938894e-18 1096.569509 2.643929 6.999942 14.068368 8.107117 -1
45 credit 37 2820 17 6 19 11 18 22 2950 O 2.775558e-17 926.523846 2.512607 6.831440 12.337051 7.989560 -1
65 smoking 12 2067 29 31 12 22 410 85 2668 O 4.163336e-17 667.140353 2.000421 6.503000 7.392166 7.889084 -1
72 smoke 18 1901 69 16 16 42 323 62 2447 O 4.163336e-17 610.415727 1.995638 6.414140 7.356894 7.802618 -1
105 restaurants 70 1497 129 24 49 14 217 202 2202 O 6.938894e-18 467.375050 1.698002 6.147132 5.463021 7.697121 -1
70 Japanese 14 83 1903 2 13 2 36 67 2120 MISC 2.775558e-17 619.743495 2.338655 6.429306 10.367280 7.659171 -1
94 card 68 1670 76 12 19 11 77 65 1998 O 5.551115e-17 537.480174 2.152073 6.286892 8.602671 7.599902 -1
95 Japan 15 79 1657 9 15 13 35 91 1914 LOC 0.000000e+00 536.676287 2.243161 6.285395 9.423071 7.556951 -1
90 Korea 1 43 69 3 5 3 16 1660 1800 LOC 1.110223e-16 542.852420 2.412677 6.296837 11.163811 7.495542 -1
104 China 19 1491 35 39 9 16 42 66 1717 LOC 1.387779e-17 482.718069 2.249123 6.179433 9.479421 7.448334 -1
106 cards 86 1435 43 18 49 19 9 24 1683 O 4.163336e-17 463.431747 2.202884 6.138659 9.051080 7.428333 -1
135 waste 98 1224 91 22 53 34 101 54 1677 O 4.857226e-17 384.396585 1.833734 5.951675 6.257210 7.424762 -1
85 \ 0 2 0 2 0 0 1672 0 1676 O 6.938894e-18 552.773688 2.638538 6.314949 13.992730 7.424165 -1
137 abortion 3 1187 16 198 30 51 22 22 1529 O 1.387779e-17 380.899737 1.992935 5.942536 7.337038 7.332369 -1
128 ##ber 47 1257 14 11 21 22 21 54 1447 O 6.245005e-17 406.990613 2.250121 6.008790 9.488883 7.277248 -1
110 cafe 4 1383 4 7 24 1 4 3 1430 O 0.000000e+00 455.214167 2.546653 6.120768 12.764306 7.265430 -1
146 recycling 15 1105 51 0 8 17 36 23 1255 O 4.857226e-17 358.669569 2.286340 5.882402 9.838860 7.134891 -1
204 * 38 787 23 9 37 10 252 77 1233 O 2.775558e-17 250.571565 1.625768 5.523745 5.082323 7.117206 -1
165 Chinese 10 1016 69 18 5 5 41 49 1213 MISC 1.387779e-17 327.424013 2.159433 5.791256 8.666222 7.100852 -1
198 disadvantage 43 822 55 45 30 31 106 30 1162 O 6.938894e-18 256.855869 1.768371 5.548515 5.861297 7.057898 -1
141 Saudi 4 7 1 0 0 5 1125 13 1155 LOC 1.387779e-17 370.663573 2.567367 5.915295 13.031464 7.051856 -1
155 professionals 11 1048 7 1 29 9 12 3 1120 O 2.775558e-17 343.283775 2.452027 5.838557 11.611860 7.021084 -1
163 Korean 0 44 45 0 0 0 3 1009 1101 MISC 0.000000e+00 329.872300 2.396892 5.798706 10.988973 7.003974 -1
238 Germany 682 23 37 6 23 10 35 59 875 LOC 4.163336e-17 216.993627 1.983942 5.379868 7.271348 6.774224 -1
187 Arabia 5 7 1 0 0 5 804 6 828 LOC 4.857226e-17 264.776793 2.558230 5.578887 12.912940 6.719013 -1
200 banning 8 780 1 0 2 0 12 2 805 O 2.081668e-17 256.810698 2.552156 5.548339 12.834746 6.690842 -1
240 import 26 665 12 5 3 6 15 21 753 O 3.469447e-17 215.900693 2.293766 5.374819 9.912194 6.624065 -1
233 debt 2 673 1 1 5 3 7 8 700 O 2.081668e-17 221.311997 2.529280 5.399573 12.544470 6.551080 -1
382 survey 58 443 31 4 46 20 15 59 676 O 6.938894e-18 136.789071 1.618806 4.918440 5.047058 6.516193 -1
287 ##land 5 561 19 3 14 11 25 24 662 O 2.081668e-17 180.919008 2.186332 5.198049 8.902503 6.495266 -1
292 Main 27 545 11 5 6 7 15 18 634 MISC 0.000000e+00 176.170904 2.222977 5.171455 9.234779 6.452049 -1
271 Taiwan 2 572 9 0 0 2 2 4 591 ORG 1.387779e-17 188.292616 2.548800 5.237997 12.791747 6.381816 -1
378 ban 27 432 27 3 14 5 44 2 554 O 0.000000e+00 137.789468 1.989740 4.925727 7.313629 6.317165 -1
392 Al 23 20 10 15 20 23 419 9 539 LOC 2.775558e-17 132.999001 1.974011 4.890342 7.199497 6.289716 -1
429 ##id 30 23 12 23 13 13 390 30 534 O 6.938894e-18 122.370084 1.833260 4.807050 6.254240 6.280396 -1
317 railway 14 497 3 1 7 0 0 0 522 O 3.469447e-17 163.249617 2.501910 5.095280 12.205783 6.257668 -1
326 mainland 1 486 1 0 0 1 0 0 489 O 3.469447e-17 160.588323 2.627212 5.078844 13.835141 6.192362 -1
365 Russia 12 4 2 3 437 3 15 10 486 LOC 0.000000e+00 142.281192 2.342077 4.957805 10.402823 6.186209 -1
421 parks 14 385 11 0 10 5 23 21 469 O 2.081668e-17 123.562674 2.107679 4.816749 8.229118 6.150603 -1
646 Government 6 261 31 7 5 31 31 19 391 O 5.551115e-17 80.904940 1.655344 4.393275 5.234881 5.968708 -1
464 scheme 2 352 4 1 12 4 0 3 378 O 6.938894e-18 115.235357 2.438844 4.746977 11.459780 5.934894 -1
432 Seoul 0 2 6 0 0 0 0 369 377 LOC 0.000000e+00 121.673166 2.581924 4.801338 13.222553 5.932245 -1
645 banned 27 259 10 12 5 1 43 17 374 O 2.081668e-17 81.189208 1.736668 4.396782 5.678389 5.924256 -1
435 betting 0 367 0 0 0 1 0 0 368 O 9.020562e-17 121.327037 2.637544 4.798490 13.978833 5.908083 -1
501 debts 20 319 1 1 11 2 2 6 362 O 0.000000e+00 103.650555 2.290620 4.641025 9.881062 5.891644 -1
462 repay 7 351 0 0 1 0 1 1 361 O 3.469447e-17 115.630270 2.562444 4.750398 12.967467 5.888878 -1
566 Turkey 6 6 5 285 8 3 11 4 328 LOC 1.387779e-17 92.252371 2.250058 4.524528 9.488285 5.793014 -1
556 Russian 13 8 1 5 291 3 4 1 326 MISC 0.000000e+00 94.658267 2.322902 4.550273 10.205250 5.786897 -1
652 Cy 4 246 3 0 1 0 1 44 299 O 6.245005e-17 80.081111 2.142638 4.383040 8.521892 5.700444 -1
727 ##hand 17 226 9 3 9 1 28 6 299 O 4.163336e-17 71.747713 1.919671 4.273156 6.818717 5.700444 -1
725 gambling 2 226 12 4 1 28 18 6 297 O 2.775558e-17 71.901734 1.936747 4.275300 6.936151 5.693732 -1
732 Spain 15 1 8 1 9 224 17 19 294 LOC 5.551115e-17 71.057635 1.933541 4.263491 6.913950 5.683580 -1
692 labour 13 235 1 7 24 13 0 0 293 O 3.469447e-17 75.380597 2.058173 4.322550 7.831651 5.680173 -1
900 breathing 16 190 27 9 8 3 14 14 281 O 2.775558e-17 58.907634 1.677086 4.075971 5.349942 5.638355 -1
797 respiratory 1 201 2 4 0 0 61 3 272 O 4.857226e-17 66.053009 1.942736 4.190458 6.977813 5.605802 -1
893 pregnancy 1 189 13 27 4 6 19 12 271 O 1.387779e-17 59.157496 1.746347 4.080203 5.733618 5.602119 -1
715 bars 12 226 6 1 3 3 13 7 271 O 3.469447e-17 72.726263 2.146901 4.286703 8.558293 5.602119 -1
922 MP 1 169 2 0 2 0 8 83 265 O 0.000000e+00 57.819628 1.745498 4.057328 5.728755 5.579730 -1
833 affairs 11 199 6 8 24 6 2 7 263 O 2.775558e-17 63.088108 1.919030 4.144532 6.814345 5.572154 -1
631 catering 2 253 0 0 3 0 3 0 261 O 6.938894e-18 83.303568 2.553366 4.422491 12.850286 5.564520 -1
1018 lung 5 161 10 5 0 3 57 8 249 O 2.775558e-17 52.013069 1.671103 3.951495 5.318028 5.517453 -1
825 cheating 3 21 1 199 3 7 7 7 248 O 6.938894e-18 63.757353 2.056689 4.155085 7.820033 5.513429 -1
701 cellular 0 3 226 0 1 1 2 9 242 O 2.775558e-17 74.036731 2.447495 4.304561 11.559357 5.488938 -1
831 graduates 2 197 4 13 5 10 4 6 241 O 2.081668e-17 63.161376 2.096643 4.145693 8.138803 5.484797 -1
994 residents 23 169 6 3 9 0 9 18 237 O 1.387779e-17 53.150582 1.794112 3.973129 6.014135 5.468060 -1
1049 shortage 1 162 22 7 11 7 15 10 235 O 3.469447e-17 50.460226 1.717795 3.921185 5.572228 5.459586 -1
703 Muslims 3 0 0 7 0 0 225 0 235 MISC 6.938894e-18 73.976242 2.518340 4.303744 12.407984 5.459586 -1
1054 conducted 17 161 11 6 19 6 5 7 232 O 2.081668e-17 50.137311 1.728873 3.914765 5.634299 5.446737 -1
908 @ 29 4 182 4 6 3 1 0 229 O 6.938894e-18 58.617270 2.047765 4.071029 7.750558 5.433722 -1
696 ##iya 0 0 0 0 0 1 226 0 227 O 6.938894e-18 74.695946 2.632456 4.313426 13.907889 5.424950 -1
699 ##dh 0 0 0 0 0 0 225 0 225 O 6.938894e-18 74.411756 2.645751 4.309614 14.094030 5.416100 -1
912 Arabic 1 3 9 2 5 8 182 13 223 MISC 3.469447e-17 58.374732 2.094161 4.066883 8.118625 5.407172 -1
972 raw 5 172 14 2 7 4 5 13 222 O 0.000000e+00 54.666603 1.969968 4.001253 7.170445 5.402677 -1
916 Kim 2 5 25 0 4 0 3 180 219 PER 0.000000e+00 58.184915 2.125476 4.063626 8.376887 5.389072 -1
856 ##fill 14 190 6 0 2 0 2 1 215 O 2.775558e-17 61.809056 2.299872 4.124050 9.972904 5.370638 -1
979 employers 13 169 4 4 6 5 9 3 213 O 4.163336e-17 53.900226 2.024422 3.987135 7.571730 5.361292 -1
1028 EU 162 0 10 4 12 19 2 1 210 ORG 2.775558e-17 51.669019 1.968344 3.944858 7.158809 5.347108 -1
1126 link 11 150 9 3 6 9 13 3 204 O 4.857226e-17 47.175205 1.850008 3.853868 6.359871 5.318120 -1
1186 junior 6 28 142 3 5 1 1 17 203 O 0.000000e+00 44.941455 1.771092 3.805361 5.877267 5.313206 -1
1137 Tokyo 0 52 139 0 0 0 1 2 194 LOC 4.163336e-17 46.536948 1.919049 3.840247 6.814477 5.267858 -1
1003 ##smo 1 162 1 0 3 0 21 4 192 O 4.163336e-17 52.564246 2.190177 3.962036 8.936794 5.257495 -1
1269 ##ah 10 18 1 7 1 3 133 17 190 O 2.081668e-17 41.757484 1.758210 3.731879 5.802042 5.247024 -1
1234 crops 1 21 136 2 3 2 8 16 189 O 2.081668e-17 43.025973 1.821205 3.761804 6.179301 5.241747 -1
909 Koreans 0 5 2 0 0 0 0 178 185 MISC 5.551115e-17 58.560732 2.532356 4.070064 12.583117 5.220356 -1
967 Low 1 168 0 0 1 1 10 4 185 O 6.245005e-17 54.846234 2.371729 4.004534 10.715904 5.220356 -1
1192 Rama 1 1 0 47 0 0 134 0 183 O 6.938894e-18 44.694624 1.953863 3.799853 7.055894 5.209486 -1
920 Moscow 2 1 0 0 176 0 2 0 181 LOC 2.081668e-17 57.976154 2.562482 4.060032 12.967963 5.198497 -1
1188 din 10 141 3 0 6 7 10 3 180 O 3.469447e-17 44.908240 1.995922 3.804621 7.358983 5.192957 -1
1389 Happiness 0 45 1 4 1 8 117 4 180 O 2.081668e-17 38.343839 1.704171 3.646594 5.496825 5.192957 -1
1305 feminist 3 1 0 6 123 45 0 0 178 O 3.469447e-17 40.680923 1.828356 3.705759 6.223647 5.181784 -1
1410 Boston 2 115 39 0 0 0 2 18 176 O 5.551115e-17 37.426595 1.701209 3.622382 5.480569 5.170484 -1
1251 ##ncies 14 133 2 3 6 6 5 0 169 O 6.938894e-18 42.463035 2.010085 3.748634 7.463948 5.129899 -1
1307 ##dan 3 2 4 27 3 0 126 3 168 O 1.387779e-17 40.503086 1.928718 3.701378 6.880686 5.123964 -1
1432 Euro 117 10 13 1 7 7 2 3 160 O 4.163336e-17 36.861226 1.843061 3.607160 6.315843 5.075174 -1
1245 adverse 2 132 5 2 2 0 11 2 156 O 1.387779e-17 42.638011 2.186565 3.752746 8.904570 5.049856 -1
1115 Credit 6 145 1 0 1 1 0 1 155 O 5.551115e-17 47.515622 2.452419 3.861059 11.616415 5.043425 -1
1478 boost 15 113 4 2 8 2 7 4 155 O 1.387779e-17 35.608768 1.837872 3.572592 6.283153 5.043425 -1
1524 ##backs 5 110 3 5 10 10 6 2 151 O 1.387779e-17 34.548652 1.830392 3.542369 6.236332 5.017280 -1
1425 unwanted 0 116 1 16 3 3 11 1 151 O 5.551115e-17 37.085838 1.964813 3.613235 7.133576 5.017280 -1
1218 Professional 1 134 0 2 4 2 3 0 146 O 4.857226e-17 43.768567 2.398278 3.778916 11.004207 4.983607 -1
1488 cheat 3 12 2 111 3 6 4 4 145 O 6.938894e-18 35.225834 1.943494 3.561780 6.983109 4.976734 -1
1518 ##gna 3 109 1 5 11 7 6 0 142 O 6.938894e-18 34.643722 1.951759 3.545117 7.041062 4.955827 -1
1275 Muslim 3 0 1 2 0 4 127 0 137 MISC 6.938894e-18 41.552489 2.426423 3.726957 11.318321 4.919981 -1
1452 [UNK] 0 0 0 25 1 0 111 0 137 O 6.938894e-18 36.402052 2.125667 3.594625 8.378486 4.919981 -1
1423 Ban 1 115 4 1 0 3 10 2 136 PER 4.163336e-17 37.155080 2.185593 3.615101 8.895922 4.912655 -1
1375 constructing 4 119 2 0 7 3 0 1 136 O 2.081668e-17 38.613469 2.271381 3.653601 9.692773 4.912655 -1
1673 SA 3 97 9 0 0 0 22 4 135 O 4.163336e-17 31.066210 1.840961 3.436121 6.302589 4.905275 -1
1664 – 0 10 8 2 1 5 10 99 135 O 5.551115e-17 31.258749 1.852370 3.442299 6.374912 4.905275 -1
1420 prospects 3 115 5 1 8 1 1 1 135 O 2.775558e-17 37.163280 2.202268 3.615321 9.045509 4.905275 -1
1687 toxic 7 98 7 1 9 2 7 4 135 O 1.387779e-17 30.771080 1.823471 3.426575 6.193321 4.905275 -1
1284 Taiwanese 0 126 1 0 0 0 0 6 133 MISC 6.938894e-18 41.385195 2.489335 3.722923 12.053259 4.890349 -1
1678 feminism 0 1 2 2 93 34 0 0 132 O 6.938894e-18 30.894983 1.872423 3.430594 6.504038 4.882802 -1
1445 Turkish 14 0 0 112 0 0 4 0 130 MISC 6.245005e-17 36.475163 2.244625 3.596632 9.436880 4.867534 -1
1300 GM 0 4 124 0 0 0 1 0 129 ORG 6.245005e-17 40.793497 2.529829 3.708523 12.551363 4.859812 -1
1435 imported 4 113 4 0 0 1 4 2 128 O 6.245005e-17 36.698093 2.293631 3.602725 9.910857 4.852030 -1
1306 Shen 0 123 0 2 0 0 0 1 126 O 2.081668e-17 40.542416 2.574122 3.702349 13.119788 4.836282 -1
1387 ##turn 2 117 0 0 6 0 0 0 125 O 9.020562e-17 38.366449 2.455453 3.647183 11.651707 4.828314 -1
1475 Berlin 110 2 1 1 5 2 1 2 124 LOC 1.387779e-17 35.738635 2.305718 3.576232 10.031382 4.820282 -1
1809 ##crow 11 91 1 2 8 4 3 2 122 O 2.775558e-17 28.808636 1.889091 3.360675 6.613354 4.804021 -1
1361 ##dah 1 0 0 0 0 0 118 0 119 O 6.245005e-17 38.978961 2.620434 3.663022 13.741691 4.779123 -1
1525 Valley 2 106 0 0 0 7 4 0 119 O 6.245005e-17 34.523316 2.320895 3.541635 10.184788 4.779123 -1
1817 personnel 6 90 6 0 13 1 2 1 119 O 4.857226e-17 28.672450 1.927560 3.355937 6.872718 4.779123 -1
1588 Kuwait 0 1 0 0 14 3 101 0 119 LOC 6.245005e-17 32.857410 2.208901 3.492177 9.105708 4.779123 -1
1408 ##amo 3 1 0 0 0 0 114 1 119 O 2.081668e-17 37.478119 2.519537 3.623757 12.422848 4.779123 -1
1631 losses 4 99 6 0 3 2 2 1 117 O 2.081668e-17 31.937194 2.183740 3.463771 8.879451 4.762174 -1
1629 ##tus 0 98 1 0 0 1 15 0 115 O 2.081668e-17 31.972400 2.224167 3.464873 9.245778 4.744932 -1
1813 ##mpo 3 90 2 4 1 4 6 3 113 O 1.387779e-17 28.711659 2.032684 3.357303 7.634549 4.727388 -1
1471 bartender 0 109 0 0 0 0 3 1 113 O 2.775558e-17 35.872822 2.539669 3.579980 12.675473 4.727388 -1
1569 Libya 0 2 0 0 0 1 102 7 112 LOC 6.938894e-18 33.335417 2.381101 3.506620 10.816808 4.718499 -1
1465 Jed 0 0 0 0 0 1 109 0 110 O 6.245005e-17 36.002604 2.618371 3.583591 13.713369 4.700480 -1
1896 Islam 1 2 1 16 1 2 85 0 108 MISC 6.245005e-17 27.463612 2.034342 3.312862 7.647216 4.682131 -1
1740 Eva 1 92 0 0 2 0 3 8 106 O 6.938894e-17 29.869508 2.254302 3.396838 9.528645 4.663439 -1
1670 ##yang 1 95 0 0 0 0 0 7 103 O 7.632783e-17 31.122490 2.417281 3.437931 11.215321 4.634729 -1
2084 crossing 9 78 1 1 4 2 5 2 102 O 4.163336e-17 24.787850 1.944145 3.210354 6.987655 4.624973 -1
2006 holy 9 0 0 0 3 2 80 8 102 O 3.469447e-17 25.635669 2.010641 3.243985 7.468100 4.624973 -1
1935 ##ibly 4 83 3 0 3 4 1 0 98 O 6.938894e-18 26.785024 2.186533 3.287843 8.904284 4.584967 -1
1954 investors 7 82 0 0 2 5 2 0 98 O 4.857226e-17 26.470502 2.160857 3.276031 8.678575 4.584967 -1
1628 Munich 97 0 0 0 0 0 0 0 97 LOC 6.245005e-17 32.079735 2.645751 3.468225 14.094030 4.574711 -1
1989 prayer 2 3 2 0 2 4 80 1 94 O 2.775558e-17 25.820292 2.197472 3.251161 9.002224 4.543295 -1
2012 Mao 0 16 78 0 0 0 0 0 94 PER 6.938894e-18 25.581976 2.177189 3.241888 8.821478 4.543295 -1
1944 prohibition 0 82 0 4 6 1 0 0 93 O 3.469447e-17 26.683035 2.295315 3.284028 9.927561 4.532599 -1
1689 Scheme 0 93 0 0 0 0 0 0 93 O 6.245005e-17 30.756859 2.645751 3.426113 14.094030 4.532599 -1
1964 foe 0 81 2 0 4 0 5 0 92 O 4.857226e-17 26.334388 2.289947 3.270876 9.874412 4.521789 -1
2045 Arab 3 1 5 1 2 2 78 0 92 MISC 6.938894e-18 25.174392 2.189078 3.225827 8.926975 4.521789 -1
1946 ##rting 0 82 2 0 3 2 3 0 92 O 2.081668e-17 26.673957 2.319475 3.283688 10.170329 4.521789 -1
1792 Taipei 0 88 0 0 0 0 0 2 90 LOC 4.857226e-17 29.016159 2.579214 3.367853 13.186771 4.499810 -1
1893 Islamic 0 0 1 3 1 1 84 0 90 O 6.938894e-18 27.512497 2.445555 3.314640 11.536954 4.499810 -1
2101 Spring 0 75 1 0 0 0 12 0 88 LOC 6.245005e-17 24.500000 2.227273 3.198673 9.274537 4.477337 -1
2047 Beijing 0 77 1 0 0 0 0 9 87 LOC 6.938894e-17 25.161665 2.313716 3.225322 10.111934 4.465908 -1
2107 Germans 75 0 0 1 2 0 3 3 84 MISC 4.163336e-17 24.407990 2.324571 3.194911 10.222289 4.430817 -1
2261 immunity 0 2 4 1 0 1 70 6 84 O 3.469447e-17 22.572107 2.149724 3.116715 8.582493 4.430817 -1
2097 terminate 1 75 0 0 0 0 7 0 83 O 3.469447e-17 24.530275 2.364364 3.199908 10.637269 4.418841 -1
2136 Hamburg 74 0 4 1 0 0 0 4 83 LOC 4.163336e-17 24.103617 2.323240 3.182362 10.208699 4.418841 -1
1966 Colombia 1 0 2 0 0 80 0 0 83 LOC 6.938894e-18 26.324596 2.537310 3.270504 12.645615 4.418841 -1
1986 tavern 0 79 0 1 0 0 0 2 82 O 2.775558e-17 25.993990 2.535999 3.257865 12.629041 4.406719 -1
1917 HK 0 82 0 0 0 0 0 0 82 LOC 6.245005e-17 27.118951 2.645751 3.300233 14.094030 4.406719 -1
2072 sorting 1 76 0 0 1 0 3 0 81 O 6.938894e-18 24.917050 2.460943 3.215552 11.715857 4.394449 -1
2161 ##erman 73 2 1 2 1 0 0 1 80 O 2.775558e-17 23.822258 2.382226 3.170620 10.828980 4.382027 -1
2036 Istanbul 2 0 0 77 0 0 1 0 80 LOC 3.469447e-17 25.332785 2.533279 3.232099 12.594730 4.382027 -1
1977 Ain 0 0 79 0 0 0 0 0 79 O 6.938894e-18 26.126794 2.645751 3.262961 14.094030 4.369448 -1
2440 Arabian 3 1 2 3 0 2 64 3 78 MISC 3.469447e-17 20.528943 2.105533 3.021836 8.211475 4.356709 -1
2488 monsters 5 2 62 0 6 0 2 0 77 O 3.469447e-17 19.911915 2.068770 2.991318 7.915085 4.343805 -1
2474 neighbour 62 2 0 0 9 3 0 0 76 O 3.469447e-17 20.049938 2.110520 2.998226 8.252529 4.330733 -1
2257 ##mission 1 69 0 0 0 1 1 1 73 O 3.469447e-17 22.635357 2.480587 3.119513 11.948276 4.290459 -1
2506 ##PM 0 61 2 0 0 0 6 3 72 O 6.938894e-18 19.754746 2.194972 2.983394 8.979748 4.276666 -1
2627 ##zone 58 2 1 0 1 9 1 0 72 O 3.469447e-17 18.721645 2.080183 2.929680 8.005932 4.276666 -1
2251 Wu 2 69 0 0 0 0 0 1 72 PER 8.326673e-17 22.688103 2.520900 3.121841 12.439791 4.276666 -1
2500 mosque 0 1 0 7 1 1 61 0 71 O 6.938894e-18 19.820680 2.233316 2.986726 9.330757 4.262680 -1
2545 Constitution 1 60 0 0 5 4 0 1 71 O 4.163336e-17 19.406426 2.186640 2.965604 8.905237 4.262680 -1
2756 therapist 1 7 1 1 1 2 55 2 70 O 5.551115e-17 17.583728 2.009569 2.866974 7.460101 4.248495 -1
2544 Grandpa 2 60 0 0 0 3 4 1 70 PER 4.857226e-17 19.421316 2.219579 2.966371 9.203455 4.248495 -1
2395 Peak 0 64 1 0 0 0 3 1 69 O 4.163336e-17 20.951954 2.429212 3.042232 11.349935 4.234107 -1
2753 Cat 3 55 2 2 0 2 2 2 68 O 2.775558e-17 17.592612 2.069719 2.867479 7.922597 4.219508 -1
2302 Lok 0 67 0 0 0 0 1 0 68 ORG 9.020562e-17 22.113344 2.601570 3.096181 13.484892 4.219508 -1
2817 sphere 7 1 0 0 53 2 3 0 66 O 1.387779e-17 17.056890 2.067502 2.836554 7.905050 4.189655 -1
2524 Cafe 0 60 0 2 0 1 1 2 66 O 2.081668e-17 19.575176 2.372749 2.974262 10.726835 4.189655 -1
2738 urine 0 55 1 1 0 1 6 2 66 O 6.245005e-17 17.760560 2.152795 2.876980 8.608888 4.189655 -1
2875 Soviet 2 2 5 0 52 0 0 5 66 MISC 1.387779e-17 16.648949 2.018054 2.812347 7.523673 4.189655 -1
2614 Consul 2 58 2 0 1 0 1 1 65 O 9.020562e-17 18.864235 2.321752 2.937268 10.193518 4.174387 -1
2644 Library 0 57 0 1 1 0 2 3 64 ORG 6.245005e-17 18.547237 2.318405 2.920321 10.159453 4.158883 -1
2640 unification 3 0 1 0 3 0 0 57 64 O 0.000000e+00 18.560711 2.320089 2.921047 10.176579 4.158883 -1
2442 Augsburg 62 0 0 0 0 0 0 0 62 LOC 6.245005e-17 20.504573 2.645751 3.020648 14.094030 4.127134 -1
2933 Shanghai 9 50 0 0 2 0 0 0 61 LOC 3.469447e-17 16.278341 2.134864 2.789835 8.455899 4.110874 -1
2722 ##ogen 0 55 2 0 0 0 3 1 61 O 2.775558e-17 17.936956 2.352388 2.886863 10.510635 4.110874 -1
3082 Industry 3 48 2 0 3 1 3 0 60 O 4.163336e-17 15.354153 2.047220 2.731386 7.746339 4.094345 -1
2625 Venezuela 0 0 0 0 3 57 0 0 60 LOC 6.938894e-18 18.734994 2.497999 2.930393 12.158144 4.094345 -1
3076 ##wl 5 48 0 0 2 1 3 1 60 O 4.163336e-17 15.386683 2.051558 2.733502 7.780010 4.094345 -1
2792 ion 1 53 1 0 0 0 4 1 60 O 1.387779e-17 17.240940 2.298792 2.847287 9.962140 4.094345 -1
2889 Tommy 1 51 0 0 0 2 5 0 59 PER 4.857226e-17 16.567570 2.246450 2.807447 9.454116 4.077537 -1
2534 Frankfurt 59 0 0 0 0 0 0 0 59 LOC 6.245005e-17 19.512416 2.645751 2.971051 14.094030 4.077537 -1
2993 Line 0 49 5 0 1 1 0 1 57 O 3.469447e-17 15.901553 2.231797 2.766417 9.316592 4.043051 -1
2692 cycling 55 0 0 1 0 0 0 0 56 O 3.469447e-17 18.145247 2.592178 2.898409 13.358838 4.025352 -1
2774 ##won 1 0 1 0 0 0 0 53 55 O 0.000000e+00 17.438732 2.536543 2.858694 12.635911 4.007333 -1
3293 sergeant 1 0 5 0 2 2 0 44 54 O 0.000000e+00 14.166422 2.098729 2.650874 8.155798 3.988984 -1
3193 Cheung 0 45 0 0 0 0 8 0 53 PER 7.632783e-17 14.738873 2.224736 2.690488 9.251036 3.970292 -1
3090 brushes 1 47 0 0 2 0 0 3 53 O 6.938894e-17 15.296548 2.308913 2.727627 10.063478 3.970292 -1
2763 ý 0 0 0 53 0 0 0 0 53 O 6.938894e-18 17.528102 2.645751 2.863805 14.094030 3.970292 -1
2962 Grandma 3 49 0 0 0 0 0 0 52 PER 6.245005e-17 16.093477 2.475920 2.778414 11.892638 3.951244 -1
3070 broadband 0 47 0 0 0 1 1 1 50 O 4.163336e-17 15.409007 2.465441 2.734952 11.768673 3.912023 -1
2897 Libyan 0 0 0 0 0 0 50 0 50 MISC 6.938894e-18 16.535946 2.645751 2.805537 14.094030 3.912023 -1
3537 ##IP 1 1 6 0 0 2 40 0 50 O 6.245005e-17 12.891373 2.062620 2.556558 7.866551 3.912023 -1
3266 Nam 1 0 3 0 1 1 0 44 50 O 5.551115e-17 14.298164 2.287706 2.660131 9.852313 3.912023 -1
3260 ##á 0 0 0 0 3 0 44 3 50 O 2.081668e-17 14.324367 2.291899 2.661962 9.893706 3.912023 -1
2895 Allah 0 0 0 0 0 0 50 0 50 PER 6.938894e-18 16.535946 2.645751 2.805537 14.094030 3.912023 -1
3541 Temple 1 40 4 0 3 0 1 0 49 LOC 6.938894e-18 12.878640 2.102635 2.555570 8.187717 3.891820 -1
3121 Chairman 0 46 0 0 0 0 2 0 48 O 7.632783e-17 15.132746 2.522124 2.716861 12.455027 3.871201 -1
3452 ##world 0 1 1 0 2 2 0 41 47 O 5.551115e-17 13.298849 2.263634 2.587677 9.617976 3.850148 -1
3117 Babe 0 0 0 0 0 1 0 46 47 O 0.000000e+00 15.169356 2.582018 2.719277 13.223798 3.850148 -1
3695 Hans 3 2 38 0 1 1 1 1 47 O 6.938894e-18 12.170020 2.071493 2.498976 7.936661 3.850148 -1
3301 Lebanese 0 3 0 0 0 0 43 0 46 MISC 6.245005e-17 14.113380 2.454501 2.647123 11.640622 3.828641 -1
3167 ##ý 0 0 0 45 0 0 0 0 45 O 6.938894e-18 14.882351 2.645751 2.700176 14.094030 3.806662 -1
3166 Brandenburg 45 0 0 0 0 0 0 0 45 LOC 6.245005e-17 14.882351 2.645751 2.700176 14.094030 3.806662 -1
3426 lantern 0 41 1 0 0 0 2 0 44 O 2.775558e-17 13.435029 2.442733 2.597865 11.504434 3.784190 -1
3420 Russians 0 1 0 0 41 0 1 0 43 MISC 3.469447e-17 13.471614 2.506347 2.600585 12.260059 3.761200 -1
3648 spheres 1 1 0 0 38 1 1 0 42 O 4.857226e-17 12.386989 2.359427 2.516647 10.584879 3.737670 -1
3907 ##words 3 1 0 0 0 1 1 35 41 O 0.000000e+00 11.329580 2.210650 2.427417 9.121641 3.713572 -1
In [1026]:
# Rough percentage of token types to be masked
print('number of token types masked: ', sum(f2['Mask'] == -1), '\n',
      'approx proportion of token types masked: ', sum(f2['Mask'] == -1)/len(f2['Mask']))
number of token types masked:  220 
 approx proportion of token types masked:  0.009435176051807694
In [1318]:
px.pie(f2,
       values='Total',
       names='NE',
       title='Percentage of Token Types per NE group'
      )
In [1319]:
px.pie(f2,
       values='Total',
       names='Mask',
       title='Percentage of Overall Tokens Masked'
      )
In [1320]:
px.pie(f2.loc[f2['Mask'] == -1],
       values='Total',
       names='NE',
       title='Masked Tokens by NE Group'
      )
In [1321]:
px.pie(f2.loc[f2['Mask'] == 0],
       values='Total',
       names='NE',
       title='Percentage of Unmasked Tokens by NE Group'
      )
In [1322]:
b = pd.DataFrame(f2.loc[f2['Mask'] == -1]['Target'].sum(axis=0), columns = ['Frequency'])
px.pie(b,
       values='Frequency',
       names=b.index,
       title='Masked Tokens per Target Group'
      )

The pie chart above illustrates that the dbscan mask does not solve the original problem and might actually increase the bias through masking. It is a shocking discovery in itself that 64 percent of the masked tokens appear in the Chinese samples, although Chinese samples represent less than 30 percent of the corpus overall, and the text lengths are not disproportionately longer than the other target groups. The reasons for this are beyond the scope of this research, but might be worth investigating from another perspective, especially if the phenomenon is due to named entities rather than topic imbalance.

In [1314]:
c = pd.DataFrame(f2.loc[f2['Mask'] == 0]['Target'].sum(axis=0), columns = ['Frequency'])
px.pie(c,
       values='Frequency',
       names=c.index,
       title='Unmasked Tokens by Target Group'
      )
In [1313]:
d = pd.DataFrame(f2['Target'].sum(axis=0), columns = ['Frequency'])
px.pie(d,
       values='Frequency',
       names=d.index,
       title='Total Tokens by Target Group'
      )

As you can see, the line of demarcation for the masked tokens follows a general formula CV_exp = b/(Total_log-c)+ a, where a, b, and c are parameters of the equation to be optimized, representing the horizontal and vertical displacement, as well as the sharpness of the curve.

To approximate optimum values for these three parameters, I am using the algorithm below:

  1. generate equally distributed combinations of all 3 parameters (a, b, c) across their appropriate ranges. Because the operations are computationally cumbersome, 4^3 = 64 value combinations will be used for each step.

  2. For each combination, calculate the proposed mask and use the following as a 'loss' function: Loss = mean_CV_unmasked + mean_rel_freq + CV_mask

  3. Locate the parameter value set which produced the minimum loss.

  4. Generate 4^3 new values nested between the old values at this location.

  5. Repeat 2.

Continue this process until the new loss and the old loss reach some threshold, say a ratio of 0.99 new_loss/old_loss.

In [ ]:
def get_mean_CV(dataframe):
    
    return mean_CV

def get_mean_rel_freq(dataframe):
    dataframe
    return mean_rel_freq

def get_CV(dataframe, token_name):
    
    return CV
In [1292]:
# initial test values

a = np.linspace(f2['CV_exp'].min(), f2['CV_exp'].max(), num=4)
c = np.linspace(f2['Total_log'].min(), f2['Total_log'].max(), num=4)
b = np.linspace(0.01, 3.01, num=4)

#loss
def get_mask(a, b, c):
    mask = pd.DataFrame(index=f2.index, columns = ['Mask'])
    for token in f2.index:
        val = b / (f2.loc[token, 'Total_log'] - c)+ a
        if (f2.loc[token, 'CV_exp'] > val).item():
            mask.loc[token, 'Mask'] = 0
        else:
            mask.loc[token, 'Mask'] = 1
    mask = mask.squeeze()
    return mask
    
def get_loss(mask):
    
    unmasked_cv = f2['CV'].loc[mask==1]
    unm_cv_mean = np.mean(unmasked_cv)
    
    masked_rel_freq = f2['Total'].loc[mask==0].sum()/f2['Total'].sum()
    
    freq_mask = f2.loc[mask==0]
    
    cv_mask = get_CV(freq_mask, 'Target')
    
    loss = unm_cv_mean + masked_rel_freq + cv_mask
    
    return loss

def get_test_vals(a, b, c):
    #mask = pd.DataFrame(data = None, index = df.index, columns = ['Mask'])
    losses = pd.DataFrame(data= None, index = None, columns = ['a', 'b', 'c', 'loss'])
    
    for i in a:
        for j in b:
            for k in c:
                combo = pd.DataFrame(data = {'a': i, 
                                    'b': j,
                                    'c': k,
                                    'loss': None
                                    },
                                    index=pd.Series(0))
                mask = get_mask(i, j, k)
                loss = get_loss(mask)
                combo['loss'] = loss
                losses = pd.concat([losses, combo])
    losses = losses.reset_index(drop=True)
    return losses
In [ ]:
loss_matrix.to_csv("mask_loss_matrix.csv")

The idea is to do a few iterations of this, zooming into areas thought to contain a minimum loss value. I have to clean up the coding, which I will do in the next step of the project. The rest of the code below can be ignored for now, since it is only formative.

In [ ]:
def find_local_min(a, b, c):
    losses = get_test_vals(a, b, c)
    min_ind = np.where(losses['loss']==losses['loss'].min())
    x = losses.loc[min_ind, 'a']
    y = losses.loc[min_ind, 'b']
    z = losses.loc[min_ind, 'c']
    
In [1268]:
loss_matrix = get_test_vals(a,b,c)
In [1293]:
l = get_test_vals([1, 0.3], [0.5, 0.8], [2, 0.3])
In [1305]:
min_ind = np.where(l['loss']==l['loss'].min())
In [1312]:
c = [2, 0.3]
c[c==l.loc[min_ind, 'c']]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [1312], in <cell line: 2>()
      1 c = [2, 0.3]
----> 2 c[c==l.loc[min_ind, 'c']]

File ~/miniforge3/envs/pyto/lib/python3.9/site-packages/pandas/core/ops/common.py:70, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     66             return NotImplemented
     68 other = item_from_zerodim(other)
---> 70 return method(self, other)

File ~/miniforge3/envs/pyto/lib/python3.9/site-packages/pandas/core/arraylike.py:40, in OpsMixin.__eq__(self, other)
     38 @unpack_zerodim_and_defer("__eq__")
     39 def __eq__(self, other):
---> 40     return self._cmp_method(other, operator.eq)

File ~/miniforge3/envs/pyto/lib/python3.9/site-packages/pandas/core/series.py:5623, in Series._cmp_method(self, other, op)
   5620 rvalues = extract_array(other, extract_numpy=True, extract_range=True)
   5622 with np.errstate(all="ignore"):
-> 5623     res_values = ops.comparison_op(lvalues, rvalues, op)
   5625 return self._construct_result(res_values, name=res_name)

File ~/miniforge3/envs/pyto/lib/python3.9/site-packages/pandas/core/ops/array_ops.py:260, in comparison_op(left, right, op)
    255 if isinstance(rvalues, (np.ndarray, ABCExtensionArray)):
    256     # TODO: make this treatment consistent across ops and classes.
    257     #  We are not catching all listlikes here (e.g. frozenset, tuple)
    258     #  The ambiguous case is object-dtype.  See GH#27803
    259     if len(lvalues) != len(rvalues):
--> 260         raise ValueError(
    261             "Lengths must match to compare", lvalues.shape, rvalues.shape
    262         )
    264 if should_extension_dispatch(lvalues, rvalues) or (
    265     (isinstance(rvalues, (Timedelta, BaseOffset, Timestamp)) or right is NaT)
    266     and not is_object_dtype(lvalues.dtype)
    267 ):
    268     # Call the method on lvalues
    269     res_values = op(lvalues, rvalues)

ValueError: ('Lengths must match to compare', (1,), (2,))
In [1270]:
px.scatter_3d(data=loss_matrix, x='a', y='b', z = 'c', color='loss')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [1270], in <cell line: 1>()
----> 1 px.scatter_3d(data=loss_matrix, x='a', y='b', z = 'c', color='loss')

TypeError: scatter_3d() got an unexpected keyword argument 'data'
In [ ]:
loss_matrix.loc[loss_matrix['loss'].min()]

References

Brezina, V. (2018). Vocabulary: Frequency, Dispersion and Diversity. In Statistics in Corpus Linguistics: A Practical Guide (pp. 38-65). Cambridge: Cambridge University Press. doi:10.1017/9781316410899.003

The University of Pittsburgh English Language Institute Corpus (PELIC). (2022). PELIC. https://eli-data-mining-group.github.io/Pitt-ELI-Corpus/

Huang, Y., Murakami, A., Alexopoulou, T., & Korhonen, A. (2018). Dependency parsing of learner English. International Journal of Corpus Linguistics, 23(1), 28-54.

Geertzen, J. , Alexopoulou, T., & Korhonen, A. (2013). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). Selected Proceedings of the 31st Second Language Research Forum (SLRF), Cascadilla Press, MA.

In [ ]: